Skip to content

Conversation

@jonastheis
Copy link
Contributor

@jonastheis jonastheis commented Oct 30, 2025

This PR implements a self-healing gap recovery mechanism for L1 messages and batch events. The actual gap detection happens in the ChainOrchestrator which subsequently notifies the L1Watcher that it needs to reset.

Specifically, the following changes are implemented:

  • detect gaps and duplicate L1 messages and commit batch events
  • handle detected gaps in ChainOrchestrator and reset L1Watcher
  • implement command receiver in L1Watcher to be able to reset to a certain sync height
    • making sure that there's no deadlock as L1Watcher blocks if the send channel is full
  • special edge case of missing L1 message for commit batch: batches are retried in derivation pipeline if an L1 message is missing. eventually the L1 message should be recovered and the batch processed.

@codspeed-hq
Copy link

codspeed-hq bot commented Oct 30, 2025

CodSpeed Performance Report

Merging #403 will not alter performance

Comparing feat/self-healing-l1-events (6a23c25) with main (e7ba7aa)

Summary

✅ 2 untouched

@jonastheis jonastheis marked this pull request as ready for review November 4, 2025 09:53
@jonastheis jonastheis requested a review from frisitano November 4, 2025 09:53
Copy link
Collaborator

@frisitano frisitano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments inline.

// testing
#[cfg(feature = "test-utils")]
{
let (tx, rx) = tokio::sync::mpsc::channel(1000);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create a L1 watcher handle and receiver channel here that can be used for testing?

async fn get_batch_by_index(
&self,
batch_index: u64,
processed: Option<bool>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of adding the processed filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought I needed it at some point. reverted.

/// New sender to replace the current notification channel
new_sender: mpsc::Sender<Arc<L1Notification>>,
/// Oneshot sender to signal completion of the reset operation
response_sender: oneshot::Sender<()>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really needed. removed.

/// This trait allows the chain orchestrator to send commands to the L1 watcher,
/// primarily for gap recovery scenarios.
#[async_trait::async_trait]
pub trait L1WatcherHandleTrait: Send + Sync + 'static {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What value does a trait add here as opposed to using a concrete type? Do we intend to have some sort of genericness on the handle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thought I'd need it for testing. removed now.

Comment on lines 90 to 93
pub struct MockL1WatcherHandle {
/// Track reset calls as (`block_number`, `channel_capacity`)
resets: Arc<std::sync::Mutex<Vec<(u64, usize)>>>,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Can't we just inspect the receiver channel directly? I think we would then be able to remove MockL1WatcherHandle and the L1WatcherHandleTrait and just use the L1WatcherHandle directly. I think this would result in simpler code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah thought I'd need it for testing. removed now.

/// A receiver for [`L1Notification`]s from the [`rollup_node_watcher::L1Watcher`].
l1_notification_rx: Receiver<Arc<L1Notification>>,
/// Handle to send commands to the L1 watcher (e.g., for gap recovery).
l1_watcher_handle: Option<H>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the option

Comment on lines 569 to 599
) {
Err(ChainOrchestratorError::L1MessageQueueGap(queue_index)) => {
// Query database for the L1 block of the last known L1 message
let reset_block =
self.database.get_last_l1_message_l1_block().await?.unwrap_or(0);
// TODO: handle None case (no messages in DB)

tracing::warn!(
target: "scroll::chain_orchestrator",
"L1 message queue gap detected at index {}, last known message at L1 block {}",
queue_index,
reset_block
);

// Trigger gap recovery
self.trigger_gap_recovery(reset_block, "L1 message queue gap").await?;

// Return no event, recovery will re-process
Ok(None)
}
Err(ChainOrchestratorError::DuplicateL1Message(queue_index)) => {
tracing::info!(
target: "scroll::chain_orchestrator",
"Duplicate L1 message detected at {:?}, skipping",
queue_index
);
// Return no event, as the message has already been processed
Ok(None)
}
result => result,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we embed this logic inside of handle_l1_message?

Comment on lines 532 to 562
match metered!(Task::BatchCommit, self, handle_batch_commit(batch.clone())) {
Err(ChainOrchestratorError::BatchCommitGap(batch_index)) => {
// Query database for the L1 block of the last known batch
let reset_block =
self.database.get_last_batch_commit_l1_block().await?.unwrap_or(0);
// TODO: handle None case (no batches in DB)

tracing::warn!(
target: "scroll::chain_orchestrator",
"Batch commit gap detected at index {}, last known batch at L1 block {}",
batch_index,
reset_block
);

// Trigger gap recovery
self.trigger_gap_recovery(reset_block, "batch commit gap").await?;

// Return no event, recovery will re-process
Ok(None)
}
Err(ChainOrchestratorError::DuplicateBatchCommit(batch_info)) => {
tracing::info!(
target: "scroll::chain_orchestrator",
"Duplicate batch commit detected at {:?}, skipping",
batch_info
);
// Return no event, as the batch has already been processed
Ok(None)
}
result => result,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we embedd this logic in handle_batch_commit?

/// # Arguments
/// * `reset_block` - The L1 block number to reset to (last known good state)
/// * `gap_type` - Description of the gap type for logging
async fn trigger_gap_recovery(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we embed the L1Notification channel inside of the L1WatcherHandle then we can implement this logic on the L1WatcherHandle directly enabling better encapsulation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in dce07df

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants